feat: Whisper prompting by connor-henderson · Pull Request #22496 · huggingface/transformers

connor-henderson · 2023-03-31T15:54:56Z

What does this PR do?

Closes #22395, thank you @sanchit-gandhi for the descriptive ask!

Note: due to initial scope expansion the commit history includes initial work towards condition_on_previous_text, always_use_initial_prompt, and pipeline integration, but these efforts have been pushed to a later PR

This this pull request adds 3 new functionalities + tests to support initial prompting functionality within Whisper's model.generate() and tokenizer:

prompt_ids param for model.generate():
- Optional param of initial prompt ids to provide context for each chunk of text generated by in model.generate()
get_prompt_ids Processor method to create initial prompt ids to pass to generate from a passed in string
Removing the prompt when the tokenizer is decoding if skip_special_tokens=True

Example new API usage:

processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
input_features = processor(input_speech, return_tensors="pt").input_features

# --- Without prompt ---
prompt_ids = processor.get_prompt_ids("Leighton")
output_without_prompt = model.generate(input_features)
print(processor.decode(output_without_prompt[0]))
# "<|startoftranscript|><|en|><|transcribe|><|notimestamps|> He has grave doubts whether Sir Frederick Layton's work is really Greek after all and can discover in it but little of Rocky Ithaca.<|endoftext|>"

# --- With prompt ---
prompt_ids = processor.get_prompt_ids("Leighton")
output_with_prompt = model.generate(input_features, prompt_ids=prompt_ids)
print(processor.decode(output_with_prompt[0]))
# "<|startofprev|> Leighton<|startoftranscript|><|en|><|transcribe|><|notimestamps|> He has grave doubts whether Sir Frederick Leighton's work is really Greek after all and can discover in it but little of Rocky Ithaca.<|endoftext|>"

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings. Haven't added anywhere outside of documenting the new generate() arg directly on the function
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sanchit-gandhi

HuggingFaceDocBuilderDev · 2023-03-31T16:15:42Z

The documentation is not available anymore as the PR was closed or merged.

hollance · 2023-04-03T09:09:55Z

Hey this PR looks really good (although I'll leave the actual review to Sanchit or Arthur).

I was just wondering whether it also makes sense to support the condition_on_previous_text option that the OpenAI repo has, since that uses the same mechanism (using the <|startofprev|> token).

In addition, there's this PR that suggests an always_use_initial_prompt option that uses the prompt on every segment, not just the first. Might be useful to consider that here as well.

connor-henderson · 2023-04-03T12:54:35Z

Hey this PR looks really good (although I'll leave the actual review to Sanchit or Arthur).

I was just wondering whether it also makes sense to support the condition_on_previous_text option that the OpenAI repo has, since that uses the same mechanism (using the <|startofprev|> token).

In addition, there's this PR that suggests an always_use_initial_prompt option that uses the prompt on every segment, not just the first. Might be useful to consider that here as well.

Hey Matthijs thanks, I'm happy to add what's wanted. Will look for HF guidance on that and whether it should be added here or in a follow on PR. temperature was another factor I saw in the Whisper model, if it was > 0.5 no prompt tokens were added (link).

connor-henderson · 2023-04-03T12:59:13Z

src/transformers/models/whisper/modeling_whisper.py

Replacing this with an actual conditional now. Any idea how the model test I added passed with this?

connor-henderson · 2023-04-03T13:29:45Z

src/transformers/models/whisper/tokenization_whisper.py

I don't think this last line handles all possible datatypes of token_ids, particularly int, torch.Tensor, and ndim > 1 np narrays. Maybe we should use to_py_obj above it first?

Maybe it's not needed to check for has_initial_prompt and simply always skip everything until the bos_token?

Looking into this it appears the bos_token is <|endoftext|> unless otherwise set, which we couldn't use for slicing

hollance

Hi @connor-henderson, I was asked by @sanchit-gandhi to do a code review for your PR. It looks pretty good already, just needs a bit of fine-tuning. I'm only just getting familiar with the Whisper code myself, so my opinions don't necessarily always make sense. ;-)

hollance · 2023-04-04T12:54:44Z

src/transformers/models/whisper/modeling_whisper.py

Sanchit suggested the argument name prompt_ids rather than initial_prompt_ids and I have to agree with that; the word prompt already implies that it precedes what happens.

I'm also wondering if this should be Optional[torch.Tensor] rather than List[int], just like input_ids in the HF NLP models. It feels "wrong" to use a list here (since we generally always use Tensors for tokenized input), even though it makes sense with how this variable is used. Maybe @sanchit-gandhi has an opinion about this?

Sounds good, renamed to prompt_ids.

With regards to the type, I wonder if this is ok here because prompt_ids purpose is just to update forced_decoder_ids and decoder_start_token_id which are type int and List[List[int]] (if I'm not mistaken, saw that here)? If we used torch.Tensor, we would also have to add importing torch to the file with the get_prompt_ids function and I don't believe we'd ever require tensor functionality. Just lmk whichever you prefer

hollance · 2023-04-04T13:06:05Z

src/transformers/models/whisper/modeling_whisper.py

What kind of window does this refer to? (I'm assuming what's called window here is what we call a chunk. If that's the case we should be consistent with the terminology.)

The "first window" refers to the initial segment of the input speech, so I believe it can be used interchangeably with chunk yes. Updated the wording to use 'chunk'

hollance · 2023-04-04T13:16:52Z

src/transformers/models/whisper/modeling_whisper.py

Nitpick: The forced_decoder_ids use a tuple instead of a list, so this should be a tuple too:

indexed_initial_prompt_ids = [(rank + 1, token) for rank, token in enumerate(initial_prompt_ids)]

I also used rank instead of idx and token instead of id, to be consistent with tokenizer.get_decoder_prompt_ids.

I'm also not 100% sure that generation_config.forced_decoder_ids is always filled in here. For example, if we do the following:

forced_decoder_ids = processor.get_decoder_prompt_ids(language="de", task="translate") predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)

then generation_config.forced_decoder_ids may not have the <|de|> token at this point. In this situation, the correct forced_decoder_ids will be filled in by super().generate(...), as far as I can tell.

Ah good catch! Yes before the implementation was overriding provided forced_decoder_ids, but this should be fixed now. Before it was adding these from the generation_config only, but now checks kwargs first.

hollance · 2023-04-04T13:48:08Z

src/transformers/models/whisper/processing_whisper.py

I prefer get_prompt_ids() as the name for this method.

Not sure that it needs to live in the tokenizer (since that involves duplicate code for the fast tokenizer). Perhaps just having it in the processor is enough?

Cool will rename and move it to just being on the processor. My thinking was that users would also want to access the method on the tokenizers directly, but I wasn't sure.

hollance · 2023-04-04T13:48:53Z

src/transformers/models/whisper/tokenization_whisper.py

Perhaps this could be an instance variable.

Now that we've updated this decode function with your above suggestion this line was removed. This same logic now is only used once and its on the processor. Would you still like me to make it an instance var there?

If it's just used once then it's not really worth making it an instance variable.

hollance · 2023-04-04T13:50:26Z

src/transformers/models/whisper/tokenization_whisper.py

Maybe it's not needed to check for has_initial_prompt and simply always skip everything until the bos_token?

hollance · 2023-04-04T13:50:47Z

src/transformers/models/whisper/tokenization_whisper.py

What is the reason for doing " " + text.strip()?

This is what was done in the Whisper implementation. I believe the model formatting expects white spaces will be removed from the beginning and at the end of the string, with one space added at the start

hollance · 2023-04-04T13:51:55Z

tests/models/whisper/test_modeling_whisper.py

Small typo in the function name. :-) But excellent work on adding these unit tests!

connor-henderson · 2023-04-05T05:14:22Z

To-do list before re-requesting review

Converting the prompt token to an ID in an instance variable gives an incorrect ID, unlike when its called in decode
--Given we're only using it in two places and it's an inexpensive op to call convert_tokens_to_ids I've left this, at least for now, to focus more on the below
Bug I found where if the ending text of the prompt matches the start of the transcribed text, that text will not be included in the transcription output. Example:
--I'm actually not sure this is a bug now. The model has learned to be penalized for repeating itself and this only happens if the end of the prompt matches the beginning of the transcription almost exactly. It also appears to be happening inside the model itself as opposed to in the logits processing or other modification before / after.

Added from @hollance's below two comments:

Add always_use_initial_prompt and condition_on_previous_text options to pipeline and model.generate()
Add prompting functionality to the automatic-speech-recognition pipeline

hollance · 2023-04-05T09:29:24Z

One more thing we'll need to do, is change the automatic-speech-recognition pipeline so that it will actually call model.generate() with the prompt, but only for the first chunk (or always if we also decide to support an always_use_initial_prompt option). This logic cannot be part of the modeling code, as model.generate() has no knowledge of which chunk of audio it's processing.

hollance · 2023-04-05T16:48:57Z

I looked a bit more into how this works today, and it turns out that 🤗 Transformers does things a bit differently than the original OpenAI code.

OpenAI does the following:

Then for the second chunk of audio, it passes the following sequence to the decoder on the first iteration: <|startofprev|> initial prompt output of the first chunk<|startoftranscript|><|en|><|transcribe|>.

For the next chunk, it uses <|startofprev|> initial prompt output of the first chunk output of the second chunk<|startoftranscript|><|en|><|transcribe|>

And so on... This list of tokens that it passes in the <|startofprev|> section grows longer and longer with each new chunk.

(When you set the condition_on_previous_text option to False, it only uses the output from the previous chunk instead of the complete history. In that case the initial prompt text is only used for the very first chunk.)

Our ASR pipeline works quite differently. It also splits up the audio in 30-second chunks but they partially overlap, and then it runs the model on these chunks in parallel. That makes it impossible to pass the previous context to these chunks, as each chunk is processed independently. So we have no way of sending <|startofprev|> initial prompt output of the first chunk<|startoftranscript|><|en|><|transcribe|> to the second chunk.

The suggested modifications to model.generate() in this PR make it possible to have both initial_prompt and the condition_on_previous_text options as in OpenAI, but it would require the user to write their own processing loop to get the same results as OpenAI. So we should definitely continue with this PR, but if we also want to support initial_prompt in the pipeline we'll have to decide on which approach we want. (It's not possible to have condition_on_previous_text in the current pipeline.)

hollance · 2023-04-11T09:04:30Z

We can provide a prompt in the pipeline like the below without modifying the pipeline at all, works for me locally. Is this sufficient / what you had in mind?

You are correct that when you do the following,

pipe = pipeline(task="automatic-speech-recognition", model="openai/whisper-tiny")
res = pipe(samples, generate_kwargs={ "prompt_ids": prompt_ids })

the pipeline will automatically pass the prompt_ids to model.generate(). However note that this pipeline only processes the first 30 seconds of the audio file. This is fine for audio that is shorter than 30 seconds.

However, to process an audio file that is longer than 30 seconds, we have to do:

res = pipe(example, generate_kwargs={ "prompt_ids": prompt_ids }, chunk_length_s=30, stride_length_s=[6, 0])

Now the same prompt_ids are passed to model.generate() for each 30-second chunk. In effect, this is the always_use_initial_prompt option.

To get the regular initial_prompt (i.e. always_use_initial_prompt disabled) and condition_on_previous_text behavior as they work in OpenAI with the current pipeline, we'd have to pass in a stride_length_s=[0,0] and batch_size=1 to make the loop work sequentially rather than in parallel, and somehow keep track of the previous outputs.

connor-henderson · 2023-04-14T04:46:20Z

Ok the additional requested features are now added so I believe this is ready for re-review. Thank you for your comments!

However note that this pipeline only processes the first 30 seconds of the audio file. This is fine for audio that is shorter than 30 seconds... In effect, this is the always_use_initial_prompt option.

I think I’m missing something here as I’ve tried this on >1 min of audio in the below example where I also added a debug line to decode the tokens inside of the pipeline as they were generated, and it appears to be properly sequential. In any case, if we don’t want this I’ll remove condition_on_previous_text from the pipeline just lmk!

pipe = pipeline(task="automatic-speech-recognition", model="openai/whisper-tiny")
res = pipe(samples, generate_kwargs={ "condition_on_previous_text": True, "prompt_ids": prompt_ids })
# ['<|startofprev|><|startoftranscript|><|en|><|transcribe|><|notimestamps|> Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.<|endoftext|>']
# ["<|startofprev|> Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.<|startoftranscript|><|en|><|transcribe|><|notimestamps|> Nor is Mr. Quilter's manner less interesting than his matter.<|endoftext|>"]
# ["<|startofprev|> Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter.<|startoftranscript|><|en|><|transcribe|><|notimestamps|> He tells us that at this festive season of the year with Christmas and roast beef looming before us, similarly drawn from eating and its results occur most readily to the mind.<|endoftext|>"]
# ["<|startofprev|> Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells us that at this festive season of the year with Christmas and roast beef looming before us, similarly drawn from eating and its results occur most readily to the mind.<|startoftranscript|><|en|><|transcribe|><|notimestamps|> He has grave doubts whether Sir Frederick Layton's work is really Greek after all and can discover in it but little of Rocky Ithaca.<|endoftext|>"]
# ["<|startofprev|> Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells us that at this festive season of the year with Christmas and roast beef looming before us, similarly drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Layton's work is really Greek after all and can discover in it but little of Rocky Ithaca.<|startoftranscript|><|en|><|transcribe|><|notimestamps|> Lennils, pictures are a sort of upguards and atom paintings and Mason's exquisite itals are as national as a jingo poem. Mr. Berkett Foster's landscapes smile at one much in the same way that Mr. Carker used to flash his teeth. And Mr. John Collier gives his sitter a cheerful slap on the back before he says like a shampoo or a turkish bath. Next man<|endoftext|>"]
# ["<|startofprev|> Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells us that at this festive season of the year with Christmas and roast beef looming before us, similarly drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Layton's work is really Greek after all and can discover in it but little of Rocky Ithaca. Lennils, pictures are a sort of upguards and atom paintings and Mason's exquisite itals are as national as a jingo poem. Mr. Berkett Foster's landscapes smile at one much in the same way that Mr. Carker used to flash his teeth. And Mr. John Collier gives his sitter a cheerful slap on the back before he says like a shampoo or a turkish bath. Next man<|startoftranscript|><|en|><|transcribe|><|notimestamps|> it is obviously unnecessary for us to point out how luminous these criticisms are, how delicate and expression.<|endoftext|>"]
# ["<|startofprev|> middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells us that at this festive season of the year with Christmas and roast beef looming before us, similarly drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Layton's work is really Greek after all and can discover in it but little of Rocky Ithaca. Lennils, pictures are a sort of upguards and atom paintings and Mason's exquisite itals are as national as a jingo poem. Mr. Berkett Foster's landscapes smile at one much in the same way that Mr. Carker used to flash his teeth. And Mr. John Collier gives his sitter a cheerful slap on the back before he says like a shampoo or a turkish bath. Next man it is obviously unnecessary for us to point out how luminous these criticisms are, how delicate and expression.<|startoftranscript|><|en|><|transcribe|><|notimestamps|> On the general principles of art and Mr. Quilter writes with equal lucidity.<|endoftext|>"]

The suggested modifications to model.generate() in this PR make it possible to have both initial_prompt and the condition_on_previous_text options as in OpenAI, but it would require the user to write their own processing loop to get the same results as OpenAI.

Aimed to address this with the new sequential loop over chunks of the input. Right now this way is incompatible with return_dict_in_generate=True as I wasn't sure how / if we'd still want to several ModelOutputs, looking for guidance here.

Also, there are hacks in a few places related to getting the id of the prompt start token and separating it from the prompt text ids. Would this be something we could add to the model or generation config?

hollance

Thanks for working on this feature, @connor-henderson . I think you were very inventive in coming up with a solution that allows us to use initial_prompt and condition_on_previous_text as in OpenAI. 😄

However, your implementation doesn't seem to fit in very well with the current design of Transformers. I'll let my colleagues at HF weigh in too, but it might be better to split this functionality as follows:

Add the prompt_ids to model.generate() as in your earlier version of the PR. All this does is insert the prompt in the <|startofprev|> section. This doesn't give us the OpenAI functionality yet, it only adds <|startofprev|> support to the modeling and tokenizer code.
Create a new pipeline that is specific to Whisper that works more like the OpenAI inference code does. The logic for managing the <|startofprev|> section then sits in the new pipeline's loop, not in the model.

(Perhaps step 2 could be a separate PR, to keep the complexity of these PRs down a bit.)

hollance · 2023-04-24T10:55:24Z

src/transformers/models/whisper/modeling_whisper.py

Adding a loop in model.generate() is a clever solution to get the pipeline to work sequentially, but it's also a bit hacky. I don't think it's the right approach for Transformers.

hollance · 2023-04-24T11:36:10Z

src/transformers/models/whisper/processing_whisper.py

Rather than returning the prompt_ids as a list of integers, it would be preferable to have them as a tensor. But even better, get_prompt_ids() should use the return_tensors argument just like tokenizer does, so that the caller can decide between numpy or torch tensors, or a list of integers.

amyeroberts · 2023-04-24T13:16:10Z

cc'ing in @gante re generate

connor-henderson · 2023-04-25T21:39:49Z

Add the prompt_ids to model.generate() as in your earlier version of the PR. All this does is insert the prompt in the <|startofprev|> section. This doesn't give us the OpenAI functionality yet, it only adds <|startofprev|> support to the modeling and tokenizer code.

Thanks @hollance I definitely agree splitting this into >1 PR is ideal, have pushed back up code for number 1 above so this can just address that portion. It now implicitly does always_use_initial_prompt.

connor-henderson · 2023-04-25T21:46:28Z

Curious if by adding return_tensors to get_prompt_ids you're setting up effectively doing condition_on_previous_text via cleverly feeding batches / prompts to model.generate() calls (i.e. the first chunk of a second model.generate call would use the text from the first chunk of the first model.generate call as a prompt and so on for each chunk in the batch), but that's more of a question for subsequent PRs

hollance · 2023-04-26T08:56:23Z

The reason I asked for the return_tensors argument is that passing the prompt_ids into model.generate() as a torch.LongTensor instead of List[int] is more consistent with how we normally pass tokens into Transformers models. I understand that inside the model you might need turn it into a list anyway for the forced_decoder_ids, but that's really an internal implementation detail. When we generate, the output token sequence is also a Tensor, and so we can concat this to the previous prompt_ids to create the next one, etc. I hope that makes sense. :-)

gante

The PR LGTM as it is, thank you for the contribution 🙌

BTW, the code changes do not match the description at the top. From what I gathered in the comments, there will be a follow-up PR, correct? In that case, would you mind updating the PR, before I tag a core maintainer for a final review? :)

gante · 2023-05-01T12:13:37Z

src/transformers/models/whisper/modeling_whisper.py

Is there a reason behind this slicing? Intuitively it makes sense to me, but I'm curious to know if there is a reference behind this choice :)

Sure I'll leave a comment in the code too, this is done to match Whisper's implementation. I believe the reason they do the -1 is to make room for the first token to generate, and the reason they do // 2 is to halve it to share context space with a prefix if one is provided (which also gets halved). I don't believe there's prefix support yet in transformers so technically the // 2 isn't necessary at this point but I didn't want to confuse future work around that if it happens. There's a good clarification of prompt vs prefix here if it's of interest.

Hello @connor-henderson, as I am using the prompting feature I noticed a bug for long prompts. It might be caused by the slicing, where it should be text_prompt_ids = text_prompt_ids[-(self.config.max_length // 2 - 1) :], to correctly account for the first token <|startofprev|>.

Hey @Helene-Maxcici, feel free to open a new issue to track this bug, tagging myself (and optionally @connor-henderson). In particular, it would be super helpful to have a reproducible code snippet to emulate the behaviour locally. See the following page for details: https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#submitting-a-bug-related-issue-or-feature-request

tests/models/whisper/test_modeling_whisper.py

hollance

This PR is shaping up nicely, @connor-henderson! I think this PR has the right amount of changes and then we can figure out how to do the sequential generation in a follow-up PR.

I've added a bunch of remarks and suggestions so we can make this fit as well into Transformers as possible. 😄

I'd also like to invite my colleagues @sanchit-gandhi and @ArthurZucker to have a look at these changes.

hollance · 2023-05-01T12:29:01Z

src/transformers/models/whisper/modeling_whisper.py

My suggestion is prompt_ids: Optional[torch.Tensor] = None but I'll let my colleagues weigh in too. @sanchit-gandhi @ArthurZucker

Changed to this suggestion

hollance · 2023-05-01T12:36:24Z

src/transformers/models/whisper/modeling_whisper.py

Nice! I think this now supports the different ways that forced_decoder_ids may be passed in?

Through model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language=..., task=...)

Through model.generate(input_features, forced_decoder_ids=forced_decoder_ids)

Through model.generate(input_features, language=..., task=...)

It would be good if there are unit tests for these different methods.

I don't believe model.generate allows passing in task or language directly as in 3. above, but I've now added tests for the other two

It does allow that (and I think it might even be the preferred method now) but for some reason the language needs to be the token, such as "<|ja|>" rather than "ja".

@connor-henderson update on the language code: we now support passing the language token, the language code, or the language name. See this (very recent) PR :)

(not sure if this info has gotten to you, many conversations in parallel in this PR)

Note that the language code change was @connor-henderson's most recent PR! This forced_generation_ids logic is in-place so that the code is backward compatible with our previous way of handling the langauge/task, where we either set it in the config as config.forced_decoder_ids, or explicitly as forced_decoder_ids to the generate method (see #21965 (comment))

haha derp, I didn't look at the author 🙈 my bad!

hollance · 2023-05-01T12:39:16Z

src/transformers/models/whisper/tokenization_whisper.py

Not sure I'm happy with this since token_ids is also used below in the call to super().decode(...).

Yea I agree, moved this prompt removal code after that super().decode(...) call to _decode so this conversion isn't necessary

hollance · 2023-05-01T12:44:55Z

src/transformers/models/whisper/tokenization_whisper.py

Edge case: prompt_end_idx is not set when token_ids has length 1. Maybe rewrite it to this:

if skip_special_tokens and isinstance(token_ids, list): prompt_token_id = self.convert_tokens_to_ids("<|startofprev|>") if prompt_token_id in token_ids: for i in range(1, len(token_ids)): if token_ids[i] in self.all_special_ids: token_ids = token_ids[i:] break

Although perhaps it's easiest to check if the very first token is <|startofprev|> rather than doing prompt_token_id in token_ids?

hollance · 2023-05-01T12:46:32Z

src/transformers/models/whisper/tokenization_whisper_fast.py

Since this is the same logic as in the regular tokenizer, maybe we can extract it into a shared helper function?

moved some of the functionality into a helper and left some for what I think is the right reusability/readability tradeoff, lmk if you think more should be abstracted

hollance · 2023-05-01T12:48:39Z

tests/models/whisper/test_modeling_whisper.py

I can understand putting this into a free function but if it's only used in one class, we generally keep it as a member function.

for sure, removed this change. i'd been moving the tests around and at one point I had two classes using this

hollance · 2023-05-01T12:51:14Z

tests/models/whisper/test_modeling_whisper.py

Nice test! I'd like to see a few more tests where you also change the forced_decoder_ids (see my comment above). The way the forced_decoder_ids get passed around is a bit brittle (due to the code for that changing a few times) and so we should make sure we have solid tests, since it's too easy for someone to change how this works and inadvertently break something.

hollance · 2023-05-01T12:52:52Z

tests/models/whisper/test_processor_whisper.py

Could you also add some tests for edge cases?

For example: processor.get_prompt_ids("") or processor.get_prompt_ids("<|startofprev|> Mr. <|startofprev|> Quilter")

The second will definitely confuse the model and decoding if they were passed to the current get_prompt_ids as is, would you prefer we strip the prompt start token or raise an error that it was included? I'll push up a change that strips it for now, lmk which you prefer and if you'd want to log a warning as well

I don't really know what would be the best approach here, was just trying to think of things that might go wrong. ;-)

Perhaps raising an error on unexpected input is the best choice, but only if it doesn't add a lot of complexity to the code.

Looks like they have their tiktoken package handle it and it raises an error if any special token is included, so will look to do the same

connor-henderson · 2023-05-03T00:17:20Z

tests/models/whisper/test_modeling_whisper.py

I feel like this is most readable / simplest as one test with comments clarifying the cases, lmk if you want them split into separate unit tests

I agree it's very readable but there's a potential issue: the model will keep state around, i.e. it will change the model.generation_config object with the new forced_decoder_ids and this may affect the next test. So I think it's better to instantiate a new model before each test.

Maybe it's also a good idea to test what happens after you do the following, just to make sure the code can handle both of these things being None:

model.config.forced_decoder_ids = None model.generation_config.forced_decoder_ids = None

Added a case for the above which involved a change in generate I'll call out below. I was aiming to order the tests to prevent conflicting state issues but you're right they're more brittle that way, split them into individual tests

Sorry, I wasn't able to fully understand the last comment - for testing the case when:

model.config.forced_decoder_ids = None model.generation_config.forced_decoder_ids = None

is this tested?

Sorry, moving parts, we had a test explicitly for this when there we 5 test cases. Then we trimmed them, per this #22496 (comment) I changed the test_generate_with_prompt_ids_and_no_non_prompt_forced_decoder_ids test to use whisper-base.en and return_timestampt=True. I just tested it tho and realized that combination didn't actually set those attributes to None, so I updated the test to explicity set those two to None.

tl;dr it was tested, then wasn't, now is again

connor-henderson · 2023-05-03T00:52:02Z

src/transformers/models/whisper/tokenization_whisper.py

went back to using 'in' check here instead of indexing to 0 in token_ids since it errors if empty

The reason I suggested putting it inside a separate if skip_special_tokens: is that the in operation needs to scan through the list, which is slow, and we can avoid it if skip_special_tokens is False.

good point i'll adapt it to check the index 0

hollance

We're slowly getting there. 😄 (The reason I'm being so nitpicky is that we're going to be changing a model that's already being used a lot, so we have to be very careful to make the right decisions.)

hollance · 2023-05-03T09:09:15Z

src/transformers/models/whisper/modeling_whisper.py

Doc comment still refers to the old type hints.

hollance · 2023-05-03T09:25:31Z

src/transformers/models/whisper/processing_whisper.py

Would it be better to strip out all special tokens / raise an error?

Not sure what OpenAI does here. In "condition on previous text mode" they don't include the <|startoftranscript|><|en|><|transcribe|><|notimestamps|> tokens when they put the previous text in the <|startofprev|> section (since that would be problematic). But I'm not sure if they also strip out the actual timestamps such as <|1.5|> etc, that would be worth looking into.

I think get_prompt_ids should not accept any tokens >= processor.tokenizer.eos_token, so we should either strip these out or raise an error. If we do want to allow timestamp tokens, then it should accept tokens > processor.tokenizer.all_special_ids[-1], since that's where the timestamp tokens begin.

Looks like they have their tiktoken package handle it and it raises an error if any special token is included, so will look to do the same

These are the tokens they raise an error on, so timestamps are included. transformers uses the same time_precision of 0.02, but notably even tho <|1.00|> is caught by OpenAI's special tokens check any number that doesn't have hundreths place precision like <|1.0|> isn't. Opted to implement catching any positive decimal number inside the special token brackets

Your solution is probably fine but a simpler approach would be to make sure no token has a value >= processor.tokenizer.eos_token. ;-)

Oh interesting, for timestamps and <|nospeech|> too? I get several low ids when trying to tokenize a timestamp

Maybe related: #20225.

I guess our tokenizer only decodes timestamp tokens, but doesn't know how to encode them?

Timestamps start at 50364. Any token id higher than that is a timestamp token.

hollance · 2023-05-03T09:27:23Z

src/transformers/models/whisper/tokenization_whisper.py

The reason I suggested putting it inside a separate if skip_special_tokens: is that the in operation needs to scan through the list, which is slow, and we can avoid it if skip_special_tokens is False.

hollance · 2023-05-03T09:31:10Z

src/transformers/models/whisper/tokenization_whisper_fast.py

Same here. I'd prefer to do the has_prompt check after checking for skip_special_tokens. (It's only a small thing and skip_special_tokens will be True most of the time anyway, so consider this a nitpick. ;-) )

hollance · 2023-05-03T09:37:12Z

tests/models/whisper/test_modeling_whisper.py

I agree it's very readable but there's a potential issue: the model will keep state around, i.e. it will change the model.generation_config object with the new forced_decoder_ids and this may affect the next test. So I think it's better to instantiate a new model before each test.

Maybe it's also a good idea to test what happens after you do the following, just to make sure the code can handle both of these things being None:

model.config.forced_decoder_ids = None model.generation_config.forced_decoder_ids = None

connor-henderson · 2023-05-03T23:21:15Z

src/transformers/models/whisper/modeling_whisper.py

This is done solely for handling the case where prompt_ids are passed in but the generation config and model config's forced decoder ids are both None. Its essentially just changing the order of operations so that we can cleanly check forced_decoder_ids is None and prompt_ids is not None to then add non-prompt forced decoder ids, none of the other functionality should change

amyeroberts · 2023-05-19T08:32:23Z

@AvivSham Thanks for reporting and @connor-henderson thanks for investigating!

I think we're good to merge 👍

dgram0 · 2023-05-20T23:54:01Z

Thank you so much for adding this! I've found that I occasionally get the following:

Traceback (most recent call last):
  File "G:\Conda\hfwhisper\lib\site-packages\transformers\models\whisper\modeling_whisper.py", line 1662, in generate
    return super().generate(
  File "G:\Conda\hfwhisper\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "G:\Conda\hfwhisper\lib\site-packages\transformers\generation\utils.py", line 1518, in generate
    return self.greedy_search(
  File "G:\Conda\hfwhisper\lib\site-packages\transformers\generation\utils.py", line 2345, in greedy_search
    next_token_logits = outputs.logits[:, -1, :]
IndexError: index -1 is out of bounds for dimension 1 with size 0

My workaround is to catch the exception and try again without the prompt_ids.

hollance · 2023-05-22T09:31:21Z

Do you have a reproducible example for this @dgram0? That seems like a serious enough bug that needs investigating further.

hollance · 2023-05-22T10:07:33Z

@Johnson-NLP

Is it possible to add 'initial_prompt' in the Fine-Tune code with a 'prompt_use_rate' to control how often to add prompts to the sentences in training sets?

Sounds like an interesting idea. Would you mind opening a new issue for this? Thanks!

sanchit-gandhi · 2023-05-22T16:07:49Z

To get prompting working with fine-tuning, we probably don't want to explicitly add 'prompted' examples per-se, but rather split longer examples up into shorter ones and feed them sequentially through the model, providing previous passages as 'context' to the model.

For example, if we had a training sample that looked like:

This is the first sentence. This is the second sentence. And finally, this is the third.

Currently what we do is feed it to the model all at once:

<|startoftranscript|> This is the first sentence. This is the second sentence. And finally, this is the third. <|endoftranscript|>

What we can do is feed the first sentence in:

<|startoftranscript|> This is the first sentence. <|endoftranscript|>

Then the second sentence, with the first sentence as context:

<|startofprev|> This is the first sentence.<|startoftranscript|> This is the second sentence. <|endoftranscript|>

And then the third, with both the first and second sentences as context:

<|startofprev|> This is the first sentence. This is the second sentence.<|startoftranscript|>  And finally, this is the third.<|endoftranscript|>

At inference time, we then just provide the "context" as our prompts:

<|startofprev|> This is the prompt.<|startoftranscript|> (model generates the rest)

See section 2.3 of the Whisper paper for an in-depth explanation as to how they achieve this during pre-training. We essentially want to do the same for fine-tuning.

For this to work, ideally we need an original sentence that is >> 30s in duration. That way when we split it up, we don't have super short examples that we feed to the model.

dgram0 · 2023-05-23T02:25:49Z

Do you have a reproducible example for this @dgram0? That seems like a serious enough bug that needs investigating further.

I'll try reproducing in a small toy example. It's reproducible on my side with the fine-tuned large private model I've been working with.

dgram0 · 2023-05-23T13:13:17Z

Do you have a reproducible example for this @dgram0? That seems like a serious enough bug that needs investigating further.

The following triggers the bug on the 13th iterations of the loop. (Usually, it takes a lot more iterations.)

from datasets import load_dataset, DatasetDict
from transformers import WhisperForConditionalGeneration, WhisperProcessor

it = iter(load_dataset("librispeech_asr", "all", split="test.other", streaming=True))
processor = WhisperProcessor.from_pretrained("openai/whisper-tiny", language="English", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
prompt = 'some text rich in domain specific vocabulary lives here'
past_prompts = ["I am from the cutter lying off the coast"]
while it:
  _ = [next(it) for x in range(3)]
  clip = next(it)
  input_features = processor(clip['audio']['array'], sampling_rate=clip['audio']['sampling_rate'], return_tensors="pt").input_features
  prompt_ids = processor.get_prompt_ids(prompt + ' - ' + ' - '.join(past_prompts))
  pred_ids = model.generate(input_features, language="english", task="transcribe", max_new_tokens=128, prompt_ids=prompt_ids)
  result = processor.batch_decode(pred_ids, skip_special_tokens=True)[0].strip()
  result_text = result.removesuffix('.')
  print(result_text)
  if result_text != '':
    past_prompts.append(result_text)
    if len(past_prompts) > 12:
      past_prompts = past_prompts[1:]

connor-henderson · 2023-05-23T15:55:34Z

@dgram0 thanks for sharing, I was able to repro this. As far as its relation to prompting I think this is another case of prompt sensitivity as opposed to a bug, but it may still be of interest with regards to Whisper generally since its the same error message as issue #22682.

I noticed that joining the prompts by ' - ' was causing the model to start predicting chinese characters, and using '. ' instead did not lead to the error (at least through 30 loops, at that point I stopped testing). I did notice degraded predictions over time though since a period did not necessarily belong after each result, and every now and again a chinese char was still predicted so. I'd just be cautious about how prompts are chained together.

dgram0 · 2023-05-23T17:57:39Z

@connor-henderson It's a bit of a contrived example meant just to recreate the issue without having to loop too much and at the same time show what may be considered a normal use case. Even without it predicting non-English characters or words you'll eventually encounter the issue within a few hundred loops.

dgram0 · 2023-05-23T21:17:59Z

@dgram0 thanks for sharing, I was able to repro this. As far as its relation to prompting I think this is another case of prompt sensitivity as opposed to a bug, but it may still be of interest with regards to Whisper generally since its the same error message as issue #22682.

I noticed that joining the prompts by ' - ' was causing the model to start predicting chinese characters, and using '. ' instead did not lead to the error (at least through 30 loops, at that point I stopped testing). I did notice degraded predictions over time though since a period did not necessarily belong after each result, and every now and again a chinese char was still predicted so. I'd just be cautious about how prompts are chained together.

The following still joins the prompts using ' - ', doesn't allow non-English characters in the prompts, doesn't seem to predict Chinese characters, does a decent job of transcription, and still fails on the 144th loop.

from datasets import load_dataset, DatasetDict
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import re
import torch

it = iter(load_dataset("librispeech_asr", "all", split="test.other", streaming=True))
processor = WhisperProcessor.from_pretrained("openai/whisper-tiny", language="English", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
_ = model.to(device)
prompt = 'Some text rich in domain specific vocabulary and example format lives here.'
past_prompts = ["I am from the cutter lying off the coast."]
while it:
  clip = next(it)
  input_features = processor(clip['audio']['array'], sampling_rate=clip['audio']['sampling_rate'], return_tensors="pt").input_features
  prompt_ids = processor.get_prompt_ids(prompt + ' - ' + ' - '.join(past_prompts))
  if device.type == 'cuda':
    input_features = input_features.cuda()
  pred_ids = model.generate(input_features, language="english", task="transcribe", max_new_tokens=128, prompt_ids=prompt_ids)
  result = processor.batch_decode(pred_ids, skip_special_tokens=True)[0].strip()
  result_text = re.sub(r"[^\u0000-\u05C0\u2100-\u214F]+$", "", result)
  print(result)
  if result_text != '':
    past_prompts.append(result_text)
    if len(past_prompts) > 12:
      past_prompts = past_prompts[1:]

connor-henderson · 2023-05-23T22:24:33Z

Thanks @dgram0 in that case I think this is a bug, I opened an issue #23723 and PR #23724 for both this and another bug this made me realize where max_new_tokens isn't properly enforced when the prompt_ids length is too large. I think they both have the same root cause.

hollance · 2023-05-24T08:44:37Z

Thanks, @dgram0. Would you have time to look at this bug @connor-henderson, since you're most familiar with this code? If not, I can have a look.

EDIT: LOL, I'm way too slow. Should probably refresh my browser before commenting. Thanks for making these new issues, Connor. 😄

hollance · 2023-05-25T13:11:19Z

@connor-henderson @sanchit-gandhi Hey, did we ever resolve the add_prefix_space issue?

If I do the following,

pipe = pipeline(task="automatic-speech-recognition", model="openai/whisper-tiny")
prompt_ids = pipe.tokenizer.get_prompt_ids("Hello, world!", return_tensors="pt")

I get the error,

TypeError: _batch_encode_plus() got an unexpected keyword argument 'add_prefix_space'

It works fine if I create a processor or tokenizer object by hand and call get_prompt_ids().

I seem to recall this issue came up before but not sure if anything was decided for it?

connor-henderson · 2023-05-25T17:38:44Z

@hollance @versae I missed that just looked into it. Appears to be a difference with the slow tokenizer accepting add_prefix_space and the fast tokenizer not recognizing or applying it, opened an issue here: #23764

* initial working additions * clean and rename, add cond stripping initial prompt to decode * cleanup, edit create_initial_prompt_ids, add tests * repo consistency, flip order of conditional * fix error, move the processor fn to the tokenizer * repo consistency, update test ids to corresponding tokenizer * use convert_tokens_to_ids not get_vocab... * use actual conditional in generate * make sytle * initial address comments * initial working add new params to pipeline * first draft of sequential generation for condition_on_previous_text * add/update tests, make compatible with timestamps * make compatible with diff. input kwargs and max length * add None check * add temperature check * flip temp check operand * refocusing to prev pr scope * remove the params too * make style * edits, move max length incorporating prompt to whisper * address comments * remove asr pipeline prompt decoding, fix indexing * address comments (more tests, validate prompt) * un-comment out tests (from debug) * remove old comment * address comments * fix typo * remove timestamp token from test * make style * cleanup * copy method to fast tokenizer, set max_new_tokens for test * prompt_ids type just pt * address Amy's comments * make style

patrickvonplaten · 2024-01-26T08:50:47Z

tests/models/whisper/test_modeling_whisper.py

+        forced_decoder_ids = [(1, 6), (2, 7), (3, 8)]
+
+        output = model.generate(
+            input_features, max_new_tokens=5, forced_decoder_ids=forced_decoder_ids, prompt_ids=prompt_ids


Why do we allow passing prompt_ids as a numpy array here?

patrickvonplaten · 2024-01-26T08:51:12Z

src/transformers/models/whisper/modeling_whisper.py

        task=None,
        language=None,
        is_multilingual=None,
+        prompt_ids: Optional[torch.Tensor] = None,


I think prompt_ids should not be allowed to be a numpy array given its signature (see: https://github.com/huggingface/transformers/pull/22496/files#r1467369773)

* initial working additions * clean and rename, add cond stripping initial prompt to decode * cleanup, edit create_initial_prompt_ids, add tests * repo consistency, flip order of conditional * fix error, move the processor fn to the tokenizer * repo consistency, update test ids to corresponding tokenizer * use convert_tokens_to_ids not get_vocab... * use actual conditional in generate * make sytle * initial address comments * initial working add new params to pipeline * first draft of sequential generation for condition_on_previous_text * add/update tests, make compatible with timestamps * make compatible with diff. input kwargs and max length * add None check * add temperature check * flip temp check operand * refocusing to prev pr scope * remove the params too * make style * edits, move max length incorporating prompt to whisper * address comments * remove asr pipeline prompt decoding, fix indexing * address comments (more tests, validate prompt) * un-comment out tests (from debug) * remove old comment * address comments * fix typo * remove timestamp token from test * make style * cleanup * copy method to fast tokenizer, set max_new_tokens for test * prompt_ids type just pt * address Amy's comments * make style

connor-henderson marked this pull request as ready for review March 31, 2023 19:21

connor-henderson commented Apr 3, 2023

View reviewed changes

hollance reviewed Apr 4, 2023

View reviewed changes

connor-henderson force-pushed the whisper-prompting branch from c9b8821 to 814860e Compare April 14, 2023 03:12

connor-henderson changed the title ~~feat: Whisper initial prompting~~ feat: Whisper prompting Apr 14, 2023

sanchit-gandhi mentioned this pull request Apr 18, 2023

Whisper doesn't compute positional embeddings properly when given batches of prompt tokens #20624

Closed

4 tasks

hollance suggested changes Apr 24, 2023

View reviewed changes

sanchit-gandhi mentioned this pull request Apr 24, 2023

ability to provide initial_prompt sanchit-gandhi/whisper-jax#21

Open

connor-henderson force-pushed the whisper-prompting branch from 89904f6 to ea23379 Compare April 25, 2023 21:22

sanchit-gandhi mentioned this pull request Apr 26, 2023

Words timestamps [HELP] sanchit-gandhi/whisper-jax#37

Open

gante approved these changes May 1, 2023

View reviewed changes

hollance suggested changes May 1, 2023

View reviewed changes

connor-henderson commented May 3, 2023

View reviewed changes

hollance suggested changes May 3, 2023

View reviewed changes

connor-henderson force-pushed the whisper-prompting branch from 20ecd4a to 7eb30f8 Compare May 3, 2023 23:02

connor-henderson commented May 3, 2023

View reviewed changes

amyeroberts merged commit 2acedf4 into huggingface:main May 19, 2023

connor-henderson deleted the whisper-prompting branch May 22, 2023 17:55

sanchit-gandhi mentioned this pull request May 23, 2023

Whisper Prompting #22395

Closed

connor-henderson mentioned this pull request May 23, 2023

Two bugs in whisper generate with prompt_ids regarding generation length #23723

Closed

4 tasks

akuzeee mentioned this pull request Jun 1, 2023

Possible bug in forced_decoder_ids in modeling_whisper.py #23926

Closed

4 tasks

kiyosumaeda mentioned this pull request Jun 2, 2023

prompt_ids does not seem to work with repetition_penalty #23951

Closed

4 tasks

Lauler mentioned this pull request Jul 12, 2023

Finetuning Whisper with prompts #24272

Open

GanymedeNil mentioned this pull request Nov 19, 2023

How to always use initial prompt in Whisper? #27592

Closed

sanchit-gandhi mentioned this pull request Dec 1, 2023

Add initial_prompt #27473

Closed

patrickvonplaten reviewed Jan 26, 2024

View reviewed changes

Conversation

connor-henderson commented Mar 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Mar 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hollance commented Apr 3, 2023

Uh oh!

connor-henderson commented Apr 3, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

connor-henderson Apr 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

connor-henderson Apr 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hollance left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

connor-henderson commented Apr 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hollance commented Apr 5, 2023

Uh oh!

hollance commented Apr 5, 2023

Uh oh!

hollance commented Apr 11, 2023

Uh oh!

connor-henderson commented Apr 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hollance left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

connor-henderson commented Mar 31, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 31, 2023 •

edited

Loading

connor-henderson Apr 3, 2023 •

edited

Loading

connor-henderson Apr 4, 2023 •

edited

Loading

connor-henderson commented Apr 5, 2023 •

edited

Loading

connor-henderson commented Apr 14, 2023 •

edited

Loading

connor-henderson May 2, 2023 •

edited

Loading